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Abstract 



By developing and applying a broad framework for rejection sampling using 
auxiliary randomness, we provide an extension of the perfect sampling algorithm of 
Fill (1998) to general chains on quite general state spaces, and describe how use of 
bounding processes can ease computational burden. Along the way, we unearth 
a simple connection between the Coupling From The Past (CFTP) algorithm 
originated by Propp and Wilson (1996) and our extension of Fill's algorithm. 
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1 Introduction 



Markov chain Monte Carlo (MCMC) methods have become extremely popu- 
lar for Bayesian inference problems (see e.g. Gelfand and Smith Smith and 
Roberts Tierney Gilks et al. fl^), and for problems in other areas, such 

[0 



or 



as spatial statistics, statistical physics, and computer science (see e.g. Fill 
Propp and Wilson for pointers to the literature) as a way of sampling ap- 
proximately from a complicated unknown probability distribution vr. An MCMC 
algorithm constructs a Markov chain with one-step transition kernel K and sta- 
tionary distribution tt; if the chain is run long enough, then under reasonably 
weak conditions (cf. Tierney [^) it will converge in distribution to vr, facilitating 
approximate sampling. 

One difficulty with these methods is that it is difficult to assess convergence to 
stationarity. This necessitates the use of difficult theoretical analysis (e.g., Meyn 
and Tweedie [^, Rosenthal [^) or problematic convergence diagnostics (Cowles 
and Carlin [|^, Brooks, et al. 0) to draw reliable samples and do proper inference. 

An interesting alternative algorithm, called coupling from the past (CFTP), 
was introduced by Propp and Wilson (see also |]38[ and 



studied and used by a number of authors (including Kendall [^], M0ller 



and has been 
Mur- 



doch and Green Foss and Tweedie Kendall and Thonnes Corcoran 
and Tweedie |^ 
and Rosenthal 



Kendall and M0ller [^, Green and Murdoch ||l8l, and Murdoch 
J^]). By searching backwards in time until paths from all starting 
states have coalesced, this algorithm uses the Markov kernel K to sample exactly 
from TT. 

Another method of perfect simulation, for finite-state stochastically monotone 
chains, was proposed by Fill [|l^]. Fill's algorithm is a form of rejection sampling. 



This algorithm was later extended by M0ller and Schladitz |]3^ and Thonnes |^ 



to non-finite chains, motivated by applications to spatial point processes. Fill's 
algorithm has the advantage over CFTP of removing the correlation between the 
length of the run and the returned value, which eliminates bias introduced by an 
impatient user or a system crash and so is "interruptible" . However, it has been 
used only for stochastically monotone chains, making heavy use of the ordering of 
state space elements. In his paper. Fill indicated that his algorithm could be 
suitably modified to allow for the treatment of "anti-monotone" chains and (see his 
Section 11.2) indeed to generic chains. A valuable background resource on perfect 
sampling methods is the annotated bibliography maintained by Wilson 

The goal of the present paper is to discuss the modifications to Fill's algorithm 
needed to apply it to generic chains, on general (not necessarily finite) state spaces. 
Our basic algorithm is presented in Section |^ as Algorithm |2.1| . An infinite- 
time-window version of Algorithm ^?T] (namely. Algorithm |3.1| ) is presented in 
Section ^. A simple illustrative example is presented in Section ^, while rigorous 
mathematical details are presented in Section 0. 
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In Section |^ we discuss how the computational burden of tracking all of the 
trajectories in Algorithm ^]T] can be eased by the use of coalescence detection 
events in general and bounding processes in particular; these processes take on a 
very simple form (see Section |7.1| ) when the state space is partially ordered and 
the transition rule employed is monotone. A weaker form of monotonicity is also 
handled in Section ^ 

In Section ^ we compare Algorithm |2.1| and CFTP. We also present a simple 



connection between CFTP and Algorithm 3.1. Finally, in Section we discuss 



the perfect generation of samples from vr of size larger than 1. 

We hope that our extension of Fill's algorithm will stimulate further research 
into this less-used alternative for perfect MCMC simulation. 

Notational note: Throughout the paper, we adopt the probabilist's usual short- 
hand of writing {X G B} for the event {uj E Q : X{uj) G B} when X is a random 
element defined on a sample space Q. 



2 The algorithm in brief 

We assume here that our Markov chain may be written in the stochastic 
recursive sequence form 

(2.1) X, = 0(X,_i,U,) 

where (Ug) is an i.i.d. sequence having distribution (say) fi. 

Omitting technical details, our interruptible algorithm for generic chains is 
conceptually quite simple and proceeds as follows. 

Algorithm 2.1 Choose and fix a positive integer t, choose an initial state X^, and 
perform the following routine. Run the time-reversed chain K for t steps [see (|6.5| ) 
for the formal definition of K], obtaining Xt_i, . . . ,Xo in succession. Then (con- 
ditionally given Xo, . . . , Xj) generate Ui, . . . , Ut independently, with chosen 
from its conditional distribution given ( p.l| ) (s = l,...,t). Then, for each ele- 
ment X of the state space X, compute chains (Yo(x), . . . ,Yt{x)), with Yo(x) := x 
and Y,(x) := 0(Y,_i(x), U,) for s = 1, 2, . . . , t. Note that Y,(Xo) = X, for 
s = 1, . . . ,t. Finally, check whether all the values Yf(x), x E X, agree (in which 
case, of course, they all equal X^). If they do, we call this coalescence, and the 
routine succeeds and reports W := Xq as an observation from tt. If not, then 
the routine fails; we then start the routine again with an independent simulation 
(perhaps with a fresh choice of t and Xt), and repeat until the algorithm succeeds. 



Remark 2.2 The algorithm works for vr-almost every deterministic choice of initial 
state X^. Alternatively, the algorithm works provided one chooses X^ from any 
distribution absolutely continuous with respect to tt. See also Remark |0|(c). 
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Here is the basic idea of why the algorithm works correctly. Imagine (1) start- 
ing the construction with Xq ~ n and independently (2) simulating Ui, . . . , U*. 
Determination of coalescence and the value of the coalesced paths at time t each 
rely only on the second piece of randomness. It follows that, conditionally given 
coalescence, Xq and X^ are independent. Hence, conditionally given coalescence 
and X(, we will still have Xq ~ tt, as desired. The algorithm constructs the ran- 
dom variables in a different order, but conditional on coalescence and the value 
of X(, the joint distributions are the same. 



Remark 2.3 (a) Note that no assumption is made in Algorithm |2.1| concerning 
monotonicity or discreteness of the state space. 



(b) This algorithm is, like Fill's original algorithm [jrT|, a form of rejection 
sampling (see, e.g., Devroye [^). This is explained in Section 2 of fl^ . 

(c) We have reversed the direction of time, and the roles of the kernels K 
and K, compared to Fill ||TI] . 



(d) Algorithm |2.1| is interruptible, in the sense of Fill []TT . 

(e) Fill's original algorithm also incorporated a search for a good value 
of t by doubling the previous value of t until the first success. For the most part, 
we shall not address such issues, instead leaving the choice of t entirely up to the 
user; but see Section |^. 

In Section 0, we will carefully discuss the underlying assumptions for Algo- 



rithm |2.1| and the details of its implementation, and also establish rigorously that 
the algorithm works as desired. This will be done by first developing, and then 
applying, results in a rather more general framework. 



3 A modified algorithm which searches for t 

Thus far we have been somewhat sketchy about the choice (s) of t in Algo- 



rithm |2rT|. As discussed in Remark |2.3| (e), one possibility is to run the repetitions 
of the basic routine independently, doubling t at each stage. However, another 
possibility is to continue back in time, reusing the already imputed values and 
checking again for coalescence. (There is an oblique reference to this alternative 
in Remark 9.3 of Fill This idea leads to the following algorithm. 



Algorithm 3.1 Choose an initial state Xq ~ vr, where vr is absolutely continu- 
ous with respect to vr. Run the time-reversed chain K, obtaining Xo,X_i, ... in 
succession. Then (conditionally given Xq, . . . , X^) generate Uq, U_i, . . . indepen- 
dently, with Us chosen from its conditional distribution given ( |2.1| ) {s = 0, —1, . . .). 
For t = 0,1, . . . and x G A", set YLi*^(x) := x and, inductively, 

Yi~'\x) := 0(Yt?(x),U,), -t + l<s<0. 
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If T < oo is the smallest t such that 

(3.1) Yq ^\x), X G A", all agree (and hence all equal Xq), 

then the algorithm succeeds and reports W := X_t as an observation from vr. If 
there is no such T, then the algorithm fails. 

Here is the basic idea of why the algorithm works correctly. Imagine (1) start- 
ing the construction with W ~ vr, and, independently, (2) simulating Uq, U^i, . . . 
[and then, after determining T, setting X_t '■= W and X^ := 0(Xs_i,Us) for 
s = — T + 1, . . . , 0]. Determination of T and the value of Xq each rely only on the 
second piece of randomness. It follows that, conditionally given coalescence, X_t 
and Xq are independent. Hence, conditionally given coalescence and Xq, we will 
still have X_t ~ vr, as desired. As before, the algorithm constructs the random 
variables in a different order, but conditional on coalescence and the value of Xt, 
the joint distributions are the same. 

Remark 3.2 (a) We need only generate Xo,X_i, . . . ,X_t and then impute Uq, 
U_i, . . . ,\J-t+i in order to check whether or not ( |3.1| ) holds. Thus, as long as 
T < oo, the algorithm runs in finite time. 



(b) One can formulate the algorithm rigorously in the fashion of Section 3^ 
and verify that it works properly. We omit the details. 



(c) Algorithm p.l| is also interruptible: specifically, T and W are conditionally 
independent given success. 

(d) See also the discussion of a "doubling" search strategy in Section ^ below. 

4 A simple illustrative example 



We illustrate Algorithm |2.1| for a very simple example (for which direct sam- 
pling from TT would be elementary, of course) and two different choices of transition 
rule. Consider the discrete state space X = {0, 1, 2}, and let vr be uniform on X. 
Let K correspond to simple symmetric random walk with holding probability 1/2 
at the endpoints; that is, putting k{x,y) := K{x, {y}), 

k{0, 0) = A;(0, 1) = k{l, 0) = k{l, 2) = k{2, 1) = k{2, 2) = 1/2, 
k{0,2) = k{l,l) = k{2,0) = 0. 

The stationary distribution is vr. As for any ergodic birth-and-death chain, K is 
reversible with respect to tt, i.e., K = K. Before starting the algorithm, choose a 
transition rule; this is discussed further below. 

For utter simplicity of description, we choose t = 2 and (deterministically) 
Xt = (say); as discussed near the end of Section a deterministic start is 
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permissible here. We then choose Xi ~ -^'(0, ■) and Xq | Xi ~ -^'(Xi, ■). How we 
proceed from this juncture depends on what we chose (in advance) for 0. 

One choice is the independent-transitions rule [discussed further in Remarks 
|6.7| (c) and |6.1l| (b) below]. The algorithm's routine can then be run using 6 in- 
dependent random bits: these decide Xi (given X2), Xq (given Xi), and the 4 
transitions in the second (forward) phase of the routine not already determined 
from the rule 

Xs_i 1-^ X^ from time s — 1 to time s (s = 1, 2). 

There are thus a total of 2^ = 64 possible overall simulation results, each having 
probability 1/64. We check that exactly 12 of these produce coalescence. Of 
these 12 accepted results, exactly 4 have Xq = 0, another 4 have Xq = 1, and 
a final 4 have Xq = 2. Thus P(C) = 12/64 = 3/16, and we confirm that 
£(Xo|C) = vr, so the algorithm is working correctly. (An identical result holds if 
we had instead chosen Xt = 1 or X^ = 2.) 

An alternative choice adapts Remarks |6.7| (b) and |6.11| (c) to the discrete setting 
of our present example. We set 



0(-,m) 



the mapping taking 0, 1, 2 to 0, 0, 1, respectively, if m = 
the mapping taking 0, 1, 2 to 1, 2, 2, respectively, if m = 1 



where u is uniform on {0, 1}. Choosing t = 2 and X^ = as before, the algorithm 
can now be run with just 2 random bits. In this case we check that exactly 3 of 
the 4 possible simulation results produce coalescence, 1 each yielding Xq = 0, 1, 2. 
Note that P(C) = 3/4 is much larger for this choice of 0. In fact, since is a 
monotone transition rule [see Definition 4.2 in Fill |ll|] or (|7.1|) below], for the 
choice Xj = it gives the highest possible value of P(C) among all choices of 0: 
see Remark 9.3(e) in Fill fll]]. It also is a best choice when Xt = 2. [On a 



minor negative note, we observe that P(C) = for the choice Xt = 1. Also note 
that the vr- average of the acceptance probabilities (3/4, 0, 3/4), namely, 1/2, is the 
probability that forward coupling (or CFTP) done with the same transition rule 
gives coalescence within 2 time units; this corroborates Remark |6.9Kc) below.] 



Remark 4.1 Both choices of are easily extended to handle simple symmetric 
random walk on {0, . . . ,n} for any n. If Xj = 0, then the second (monotone) 
choice is again best possible. For fixed c G (0, 00) and large n, results in Fill [|ll| 
and Section 4 of Diaconis and Fill |^ imply that, for t = en? and X^ = 0, the 
routine's success probability is approximately p(c); here p(c) increases smoothly 
from to 1 as c increases from to 00. We have not attempted the corresponding 
asymptotic analysis for the independent-transitions rule. 
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5 Conservative detection of coalescence 

{Note: In order to focus on the main ideas, in this Section]^ — as in the previous 
sections — we will suppress measure-theoretic and other technical details. In an 
earlier draft we provided these details; we trust that the interested reader will be 



able to do the same, using the rigorous treatment of Algorithm in Section y as 
a guide. Also note that the terminology used in this section — detection process, 
bounding process, etc. — has varied somewhat in the perfect sampling literature.) 

Even for large finite state spaces X, determining exactly whether or not co- 



alescence occurs in Algorithm can be prohibitively expensive computation- 
ally; indeed, in principle this requires tracking each of the trajectories Y{x) : = 
(Yq{x), . . . ,Yt{x)), X G X, to completion. 

However, suppose that E is an event which is a subset of the coalescence event. 
That is, whenever E occurs, coalescence occurs (but perhaps not conversely). 
Assume further that E, like the coalescence event, is an event whose occurrence 



[or not) is determined solely by Ui, . . . , \Jt- Then Algorithm |2.1| remains valid 



if, instead of accepting W whenever coalescence occurs, we accept only when 



the event E occurs. Indeed, the explanation for why Algorithm |2.1| works goes 
through without change in this case. 



It follows that, when implementing Algorithm 2A, it is permissible to use 
conservative detection of coalescence, i.e., to accept W as an observation from vr 
if and only if some event E occurs, provided that E is a subset of the coalescence 
event and is a function of Ui, . . . , only. We call such an event E a coalescence 
detection event. 

Similar considerations apply to Algorithm Indeed, let T be as in that 
algorithm. Let T' be any other positive-integer-valued random variable, which 
is completely determined by Uo,U_i,... and is such that T' > T. Then Al- 
gorithm |3.1| remains valid if we replace T by T', i.e., if we report W := X_x' 



instead of reporting X_t. Indeed, the explanation for why Algorithm |3]1| works 
goes through without change in this case. 



For example, we might choose in Algorithm |3.1| to let T' be the smallest t 



which is a power of 2 such that (|3.1| ) holds and report X_t' instead. This has the 
computational efficiency advantage (similar to that of CFTP) that we can use a 
"doubling" search strategy, trying t = 1, 2, 4, 8, ... in succession, until we find the 
first time t (= T') such that holds. 

In such and in other cases discussed below, conservative coalescence 

detection will often lead to easier and faster application of our algorithms. Thus, 
such detection may be of considerable help in practical applications. 



7 



5.1 Detection processes 

In practice, a coalescence detection event is constructed in terms of a detec- 
tion process. What we mean by this is a stochastic process D = (Dq, . . . ,Dj), 
defined on the same probabihty space as U = (Ui, . . . , Ut) and X = (Xq, . . . , X^), 
together with a subset A of its state space V, such that 

(a) D is constructed solely from U, and 

(b) {Ds G A for some s <t} C {Yt{x) does not depend on x}. 
Then E := {D^ G A for some s < t} is a coalescence detection event. 

Remark 5.1 In practice, D usually evolves Markovianly using U; more precisely, 
it is typically the case that there exists deterministic do E V such that Dq = do 
and D has the stochastic recursive sequence form [paralleling ( |2.1| )] 

D, :=(5(D,_i,U,), l<s<t. 

— * 

The important consequence is that, having determined the trajectory X and 
the imputed U, the user need only follow a single trajectory in the forward phase 
of the routine, namely, that of D. 



Example 5.2 We sketch two illustrative examples of the use of detection processes 
that do not immediately fall (see Remark |5.3| below) into the more specific settings 
of Sections ^]2|or|7]^. We hasten to point out, however, that because of the highly 
special structure of these two examples, efficient implementation of Algorithm |2.1| 
avoids the use of the forward phase altogether; this is discussed for example (a) 
in Fill PI. 

(a) Our first example is provided by the move-to-front (MTF) rule studied 
in |12[. Let K be the Markov kernel corresponding to MTF with independent and 
identically distributed record requests corresponding to probability weight vector 
{wi, . . . ,Wn)', see (2.1) of |12[ for specifics. The arguments of Section 4 of |T2| 



show that if is taken to be the set of all records requested at least once among 
the first s requests and A is taken to consist of all (n — l)-element subsets of 
the records 1, . . . ,n, then D is a detection process. Similar detection processes 
can be built for the following generalizations of MTF: move-to-root for binary 
search trees (see Dobrow and Fill ||10|) and MTF-like shuffles of hyperplane 
arrangement chambers and more general structures (see Bidigare, et al. |^ and 
Brown and Diaconis [Q). 

(b) A second example of quite similar spirit is provided by the (now well- 
known) Markov chain (X^) for generating a random spanning arborescence of the 
underlying weighted directed graph, with vertex set U, of a Markov chain (Ut) 
with state space U and kernel q. Consult Propp and Wilson |^ (who also discuss 
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a more efficient "cycle-popping" algorithm) for details. We consider here only the 
special case that (\Jt) is an i.i.d. sequence, i.e., that q{v,w) = q{w). A transition 
rule (f) for the chain (X^) is created as follows: for vertex u and arborescence x 
with root r, 0(a;, u) is the arborescence obtained from x by adding an arc from r 
to u and deleting the unique arc in x whose tail is u. Then it can be shown that 
if Ds is taken to be the set of all vertices appearing at least once in (Ui, . . . , U^) 
and A := {W}, then D is a detection process. 



5.2 Bounding processes 

We obtain a natural and useful example of a detection process D when (a) D is 
constructed solely from U, (b) the corresponding state space V is some collection 
of subsets of X , with 

A := {{z} : zeX}, 

and 

(c) D, D {Y,(a;) : x e X}. 

The concept is simple: in this case, each set is just a "conservative estimate" 
(i.e., a superset) of the corresponding set {Ys{x) : x G X} of trajectory values; 
thus if Ds = {z}, then the trajectories Y(x) are coalesced to state z at time s 
and remain coalesced thereafter. We follow the natural impulse to call such a 
set- valued detection process a hounding process. Such bounding processes arise 
naturally in the contexts of monotone (and anti-monotone) transition rules, and 
have been used by many authors: see Section ffTIl . Other examples of bounding 
processes can be found in Haggstrom and Nelander (for CFTP) and in works 
of Huber: see [0] and in connection with CFTP and p4| in connection with 



our algorithm. 

Of course, nothing is gained, in comparison to tracking all the trajectories, by 
the use of a bounding process unless the states of V have more concise representa- 
tions than those of generic subsets of X\ after all, we could always choose V = 2"^ 
and Ds = {Ys{x) : x G X}. One rather general, and frequently applied, setting 
where compact representations are possible, discussed in Section fTTI , is that of a 
realizably monotone chain on a partially ordered set (poset) X . See the references 
listed in Remark |7^(a) for examples of the use of bounding processes for such 
chains. 

See also Remark |T|^(b) for a brief discussion of the closely related concept of 
"dominating process." 

Remark 5.3 With our algorithm, one can take advantage of detection processes 
that are more general than bounding processes. Consider Example ^[^(a) as an 
illustrative case. Suppose we start the ffist (i.e., time-reversed) phase of our algo- 
rithm in the identity permutation (as is done in [|12]). When we run the forward- 
time phase of the algorithm, we only need to keep track of the unordered set of 
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records requested; this is what the detection process D of Example P3(a) does. If 



the event E of Section |5.1| occurs, that is, if all, or all but one, of the records have 
been requested at least once by time t, then we have detected coalescence. Then, 
without further work, we know the state to which the trajectories have coalesced: 
namely, our initial identity permutation. To maintain a bounding process for this 
same example, we would have to do more work: we would have to keep track of 
the ordered set of records requested, ordered by most recent request. (Note that a 
bounding process could be constructed by combining information from the dom- 
inating process and the trajectory generated in the algorithm's initial backward 
phase. But it is not useful to do so.) 



6 A mathematically rigorous framework 

6.1 The formal framework 

We now formally set up a general framework for rejection sampling using 
auxiliary randomness, paying careful attention to technical details. We will apply 
this framework not only to provide a rigorous treatment of Algorithm |2.1| (in 
Section |6.2|) , but also (independently) to handle a variant (Algorithm |7.2| ) that 
applies to a stochastically monotone kernel K. 

We need to set up two probability spaces. In our main application to Algo- 



rithm pyTl , the first space (fi. A, P) — designated in ordinary typeface — will be use- 
ful for theoretical considerations and for the computation of certain conditional 
probability distributions. The second space (f2. A, P) — designated in boldface 
type — will be the probability space actually simulated when the algorithm is run. 
All random variables defined on the first space (respectively, second space) will 
also be designated in ordinary typeface (resp., boldface type). We have chosen this 
notational system to aid the reader: Corresponding variables, such as Xq and Xq, 
will play analogous roles in the two spaces. 

Recall that for a measurable space {Z, J-') and a (not necessarily measurable) 
subset E C Z, the trace a-field (on the sample space E) is the a-field {E fl E : 
E G J-'}. In setting up both probability spaces, we assume: 

(i) {X, B) and (A", B') are measurable spaces, B' C X', and B" is the trace 
a-field for B' 

(ii) TT is a probability measure on B. 

For our first probabihty space, we also assume that 

(iii) {Q, A, P) is a probability space on which are defined mappings X : Q ^ X 
and X' -.n^ X' &ndY -.n^ X' and a set C C 1] 
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(iv) X is measurable from A to B, with C{X) = PX ^ = tt 
and that the following two conditions hold, for every B" G B": 

(v) {X' e B"} nc eAandiY e B"} nc eA 

(vi) P{{X' G B"}nC\X e B) = P{{Y G B"}r\C) forall5 G Swith7r(5) > 0. 

Notice that (v) and (vi) are implied by 

(V) {X' G B"}nC eA, Y = X'on the set C, and X and the event {Y G B"}r\C 
are independent. 

Proposition 6.1 Under assumptions (i)-(vi), 

(6.1) P{{x' G B"} ncn{x eB}) = P{{x' e B"} n C) 7r{B) 

for every B E B and every B" G B" . 

Proof. We may assume vr(i?) > 0, in which case (vi) implies 

(6.2) P{{X' eB"}nCn{x eB}) = P{{Y e B"}nC)n{B), 

(6.3) P{{X' G B"} nC) = P{{Y G B"} n C) 

for general B E B and for i? = X , respectively. Substituting ( |673| ) into ( |6.2| ) 
gives (|Tl|). ■ 

Two corollaries follow immediately: 

Corollary 6.2 Under assumptions (i)-(vi), if P{{X' G -B'} fl C) > 0, then 

C{X I {X' G B'} n C) = TT. 

Corollary 6.3 Suppose assumptions (i)-(vi) hold for B' = X' ; in particular, (v) 
then implies that C E A. Suppose also that X' is measurable from A to B' . 
Assume P{C) > 0. Then 

X' and X are conditionally independent given C, and C{X\C) = n. 

Now we set up the second probability space. Specifically, consider the following 
assumptions: that 

(vii) (ri. A, P) is a probability space on which are defined mappings ^ : Q, —>■ X 
and X.' -.n^ X' and a set C C 
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(viii) X is measurable from A to B, 
that, for every B" G B", we have 

(ix) {X' e B"} n C G A, 

and that we have the following basic connection between the two spaces: 

(x) The measure 

P({X' G dx'} n C n {X G dx}) 

on the product space {B' x X,B" ® B) has a density D{x^x') = D{x') 
{x' G B') that doesn't depend on x E X with respect to the measure 

P{{x'edx'}n Cn{Xedx}). 

Notice that (x) is implied by conditions (x')-(x'"), wherein C(X.') = P(X')^^: 
(x') B' = X', and X' and X' are measurable (from A and A, respectively, to B') 

(x") /:(X') < C{X'), with Radon-Nikodym derivative D 

(x'") There exists a conditional subprobability distribution 

P{Cn{x edx}\x' = x), x'ex', 

which also serves as conditional subprobability distribution 

P(Cn{XGc/x}|x' = x'), x'ex'. 

It is now key that the results of Proposition |6.1| and Corollaries 
carry over to our second space: 

Proposition 6.4 Under assumptions (i)-(x), 

(6.4) P({X' G B"} n C n {X G fi}) = P({X' G B"} n C) 7t{B) 

for every B E B and every B" G B" . 

Proof. Let D be the Radon-Nikodym derivative guaranteed by assumption (x). 
Then, using Proposition |6.1| , 

p({x'G5"}nCn{XGfi}) = / D{x')P{{x' edx'}ncn{x e B}) 

JB" 

= I Dix')Pi{X' edx'}nC)TiiB). 

JB" 

As usual, setting B = X and substituting, we obtain ( |6.4| ). ■ 
Corollary 6.5 Under assumptions (i)-(x), z/P({X' G B'} fl C) > 0, then 

/:(X I {X' G 5'} n C) =7r. 

Corollary 6.6 Suppose assumptions (i)-(x) hold for B' = X' . Suppose also that 
X' is measurable from A to B' . Assume P(C) > 0. Then 

X' and X are conditionally independent given C, and £(X|C) = vr. 



3.2 and O 
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6.2 Details for Algorithm 2.1 



The goal of this subsection is to describe when and how Algorithm |2.1| can be 
applied legitimately. 

The space {X,B): Recall that a Polish space is a complete separable metric 
space. For convenience, we shall assume that the measurable state space {X, B) of 
interest (on which the probability measure vr of interest is defined) is isomorphic 
to a Borel subset of a Polish space (with its trace Borel a- field). This assumption 
will at once guarantee the existence of such objects as conditional distributions 
that would otherwise require individual arguments or assumptions. We call such a 
space a standard Borel space. Our assumption should cover most cases of applied 
interest. 

The kernel K and its time-reversal K : Let : x i5 — > [0, 1] be a Markov 
transition kernel on X; that is, we suppose that K{x, ■) is a probability measure 
on B for each x & X and that K{-,B) is a S-measurable function for each B & B. 
The kernel is chosen (by the user) so that vr is a stationary distribution, i.e., so 
that 

'7i{dx)K{x,dy) = n{dy) on X. 



X 



Since {X, B) is standard Borel, there exists "a conditional distribution the other 
way around" — more precisely, a Markov kernel K on X satisfying 

(6.5) n{dx)K{x,dy) = Ti{dy)K{y,dx) on X X. 

Given vr and K, the kernel K{y,dx) is 7r((i?/)-almost surely uniquely defined. We 
choose and fix such a K. 

The transition rule (p: It can be shown that there exists a transition rule 
which can be used to drive the construction of the Markov chain of interest. More 
precisely, our assumption that {X, B) is standard Borel implies that there exists 
a standard Borel space {U,J-'), a product-measurable function (p : X xU ^ X, 
and a probability measure fi on T , such that 

(6.6) K{x, B) = fi{u : 0(x, u) E B}, x E X, B E B. 

Such (p (with accompanying /x) is sometimes called a transition rule. We choose 
and fix such a (0, /i) . 



Remark 6.7 (a) Conversely, if has the stated properties and K is defined by (|6.6|) , 
then K is a. Markov kernel. 

(b) A transition rule can always be found that uses {U, JF, ^) = ([0, 1], Borels, 
uniform distribution). The proof of existence (cf. Theorem 1.1 in Kifer and 
Remark (iv) at the end of Section 5.2 in Diaconis and Freedman [Q) makes use of 
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inverse probability transforms and certain standard reduction arguments. In the 
special case that {X,B) = ([0, 1], Borels), we can in fact use 

u) = G{x, u) 

where G{x, ■) is the usual inverse probability transform corresponding to the dis- 
tribution function u i— ^ K{x, [0,m]). 

(c) If (A", B) is any discrete space (i.e., if X is countable and B is the total a- 
field), a very simple alternative choice is the following "independent-transitions" 
transition rule. Let lA = X'^ (with T the product cx-algebra), let be product 
measure with xth marginal K{x^ ■) (x G A"), and let be the evaluation function 

0(x, u) := u{x). 

(d) Many interesting examples of transition rules can be found in the literature, 
including Diaconis and Freedman p| and the references cited in Section |I|. 

(e) Usually there is a wealth of choices of transition rule, and the art is to find 
one giving rapid and easily detected coalescence. Without going into details at 
this point, we remark that the transition rule in (c) usually performs quite badly, 
while transition rules having a certain monotonicity property will perform well 
under monotonicity assumptions on K. 

The Markov chain and the first probability space: From our previous comments 
it is now easy to see that there exists a standard Borel space {U, JF) , a transition 
rule (0,yu), and a probability space {fl,A,P) on which are defined independent 
random variables Xq, Ui, U2, ■ ■ ■ ,Ut with Xq ~ tt and each Us ~ fi. Now induc- 
tively define 

(6.7) X, :=0(X,_i,f/,), l<s<t. 

Then X := (Xo,...,Xt) is easily seen to be a stationary Markov chain with 
kernel K, in the sense that 

(6.8) P(Xo G dxo, . . . , Xt G dxt) = n{dxo)K{xo, dxi)- ■ -K^Xt-i, dxt) on X^^^. 

In fact, for each x G A" we obtain a chain with kernel K started from x by defining 
Yq{x) := X and, inductively, 

Ys{x) :=0(n_i(a;),f/,). 

Let Y{x) := {Yq{x), . . . ^Yt{x)). In this notation we have Y{Xq) = X. Recalling 
the notational note at the end of Section |l|, let 

(6.9) C := {Yt{x) does not depend on x} 

denote the set of sample points u for which the trajectories Y{x) have all coalesced 
by time t. We assume that C belongs to the cr-field ^{U) generated hj U : = 
(f/i,...,f/0- 
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Remark 6.8 For this remark, suppose that X is a. Borel subset of a Pohsh space 
(and hence a separable metric space in its own right). We will prove that continuity 
of the transition rule (f){x,u) in x & X for each u ElA is sufficient for C G <y{U), 
and we note that this is automatic if X is discrete. 

As guaranteed by separability of A", let D be a countable dense subset of X. 
Given z ^ X and e > 0, let Bz{e) denote the closed ball of radius e centered at z. 

Suppose that 0(-, u) is continuous for each u & Define the iterates ip'^ : X x 
W ^ X (s = 1, 2, . . . , t) inductively by 0^ := and 

(j)\x] Ml, ... , M^) := 0(0''"^(x; Ml, ... , M^_i); m^). 

Note that 0* is, like 0, continuous in its first argument, and that Yt{x) = 0*(x; U). 
Using the separability of A", it is not hard to show that 

C = n^Li U,eD n^eD{<l>\x; U) G 5,(l/n)}, 

from which the desired measurability of C is evident. 

Now observe that conditions (i)-(iv) and (v') in Section |6.1| are satisfied by 
fixing Xq & X arbitrarily and taking 

(6.10) {X\B') = {X,13), B' = X', B" = 13, 

X = Xo, X' = Xt, Y = Yt{x*), C as at (6.9). 

Note that the independence in (v') follows from the fact that Xq and U have been 
chosen to be independent. 

The second probability space and the algorithm: The key to setting up the 
second probability space is to satisfy assumption (x'") in Section In calculating 
the first-space conditional distribution mentioned there, we will make use of the 
auxiliary randomness provided by Xi, . . . , Xt-i and U and compute in stages. 
First observe from (|6.8|) and repeated use of (|6.5|) that 



P(Xo G dxo, . . . , Xi_i G dxt-i \ Xt = Xt) = K{xt, dxt-i) ■ ■ ■ K{xi, dxo) 
serves as a conditional distribution for {Xq, . . . , Xt-i) given Xt = Xt- Next, we will 



discuss in Section how to compute C{U \ X = x). Finally, from our assumption 
that C G o"(?7), it follows that we can write the indicator Z := Ic as Z = T{U) 
for some product-measurable T : {0, 1}, and one can check the intuitively 

obvious assertion that unit mass at r(M) serves as a conditional distribution for Z 
given {X,U) = (x, m). We get the conditional distribution in (x'") by chaining 
together the conditional distributions we have computed and integrating out the 
auxiliary variables, in the obvious and standard fashion. 
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Moreover, our discussion has indicated how to set up and simulate the second 
space. To satisfy assumption (x") in Section |6.1| , we assume that the law of 



chosen by the user is absolutely continuous with respect to tt; of course we do 
not assume that the user can compute the Radon-Nikodym derivative D. [For 
example, in the common situation that {X, B) is discrete and 7r{x) > for every 
X & X, the value of Xj can be chosen deterministically and arbitrarily.] Having 
chosen Xt = xt, the user draws an observation Xt_i = xt-i from K{xt,-), then 
an observation Xf_2 = from K{xt-i,-), etc. Next, having chosen X = x 
[i.e., (Xo,...,X() = {xq, . . . ,Xt)], the user draws an observation \J = u from 

— * — * 

C(U \ X = x). Finally, the user sets Z = r('u) and declares that C, or coalescence, 
has occurred if and only if Z = 1. With the definitions ( |6.10| ), C as above, and 



X — Xq, X' — Xt, 

assumptions (i)-(x) are routinely verified. According to Corollary |6.6| , if 
P(C) > 0, then £(Xo|C) = vr. It follows that the conditional distribution of 
output from Algorithm given that it ultimately succeeds (perhaps only after 
many iterations of the basic routine) is vr, as desired. 

Remark 6.9 (a) If P(C) > for suitably large t, then ultimate success is (a.s.) 
guaranteed if the successive choices of t become large. A necessary condition for 
ultimate positivity of P(C) is uniform ergodicity of K. This condition is also 
sufficient, in the (rather weak) sense that if K is uniformly ergodic, then there 
exists a finite integer m and a transition rule (pm for the m-step kernel K"^ such 
that Algorithm pTTI , applied using (p^, has P(C) > when t is chosen sufficiently 
large. Compare the analogous Theorem 4.2 for CFTP in Foss and Tweedie ||15|| . 
A similar remark applies to Algorithm |3.1| . 

(b) Just as discussed in Fill [^ (see especially the end of Section 7 there). 



the algorithm (including its repetition of the basic routine) we have described is 
interruptible; that is, its running time (as measured by number of Markov chain 
steps) and output are independent random variables, conditionally given that the 
algorithm eventually terminates. 

(c) If the user chooses the value of Xj (= z, say) deterministically, then all 
that can be said in general is that the algorithm works properly for vr-a.e. such 
choice. In this case, let the notation Pz(C) reflect the dependence of P(C) on the 
initial state z. Then clearly 



I P,(C)7r(d^) =P(C), 



which is the unconditional probability of coalescence in our flrst probability space 
and therefore equal to the probability that CFTP terminates over an interval of 
width t. This provides a flrst link between CFTP and Algorithm \2.1[ Very roughly 



16 



recast, the distribution of running time for CFTP is the stationary mixture, over 
initial states, of the distributions of running time for Algorithm \2.1\ . For further 



elaboration of the connection between the two algorithms, see Section 8.2 



6.3 Imputation 

In order to be able to run Algorithm ETTl the user needs to be able to impute U 



from X, i.e., to draw from C(U \ X = x). In this subsection we explain how to do 
this. 

We proceed heuristically at first: 

P{U edu\X = x) 
= P{U e du\Xo = xo, (t>{xo,Ui) = xi, . . . , (f)(xt-^i,Ut) = xt) by (pJl) 
= P{U G du \ 0(xo, Ui) = xi, . . . , 4>{xt-i, Ut) = Xt) by indep. of Xq and U 
= P{Ui G dui I 0(xo, Ui) = xi) X ■ ■ ■ X P{Ut G dut \ Ut) = Xt) 

by independence of Ui, ... ,Ut 
= P{Ui G dui I 4>{xo, Ui) = xi) X ■ ■ ■ X P{Ui G dut \ 4>{xt-i, Ui) = Xt) 

since Ui, . . . ,Ut are identically distributed 
= P{Ui G dui \ Xq = xo, Xi = xi) X ■ ■ ■ X P{Ui G dut \ Xq = Xt-i, Xi = Xt), 

where the last equality is justified in the same fashion as for the first two. 

In fact, the result of this heuristic calculation is rigorously correct; its proof is 
an elementary but not-entirely-trivial exercise in the use of conditional probability 
distributions. The existence (and a.s. uniqueness) of a conditional distribution 
C{Ui \Xq = ■, Xi = ■) is guaranteed by the fact that {U,J-') is standard Borel; 
moreover. 

Lemma 6.10 The t-fold product of the measures 

P{Ui G dui I Xo = Xo, Xi = xi), . . . , P{Ui G dut \ Xq = Xt-i, Xi = Xt) 

serves as a conditional probability distribution P{U E du\X = x). 

In setting up the second probability space, therefore, the user, having chosen 
X = X, draws an observation U = m by drawing Ui, . . . , Uj independently, with 
Us chosen according to the distribution C{Ui \ Xq = Xg^i, Xi = Xg). 



Remark 6.11 (a) There are subtleties involved in the rigorous proof of Lemma |6.10 
In particular, there is no justification apparent to us, in general, that the condi- 
tional distributions C{Ui\ 4>{xo, Ui) = ■), one for each fixed Xq E X, can be chosen 
in such a way that P{Ui G F \ 0(xo, Ui) = Xi) is jointly measurable in {xq, Xi) for 
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each F ^ T . Nevertheless, if we assume such measurabihty, then one can show 
rigorously that 

P{Ui G dui I (j){xo, Ui) =xi) X ■■■X P{Ui G dut \ (j){xt-i, Ui) = Xt) 

serves as P{U E du\X = x). 

(b) If {X, B) is discrete, then of course the measurabihty in (a) is auto- 
matic. Suppose we use the "independent-transitions" rule discussed in Re- 
mark |6.7| (c). Then the measure /i, but with the Xoth marginal replaced by S^^, 
serves as C{Ui \ 0(xo, Ui) = xi) = C{Ui \ Ui{xq) = xi) and therefore as C{Ui \ Xq = 
xq, Xi = xi). Informally stated, having chosen = Xg and Xs_i = Xg-i, the 
user imputes the forward-trajectory transitions from time s — 1 to time s in Al- 
gorithm |2.1| by declaring that the transition from state Xg-i is to state Xs and 
that the transitions from other states are chosen independently according to their 
usual non-X- conditioned distributions. 

(c) As another example, suppose that X = [0, 1] and we use the inverse proba- 
bility transform transition rule discussed in Remark |6.7| (b). Suppose also that each 
distribution function F{xo, ■) = K{xo, [0, ■]) is strictly increasing and onto [0, 1] 
and that F{xo,Xi) is jointly Borel-measurable in xq and Xi. Then Sf{xo,xi) serves 
as £(f/i \Xq = xo, Xi = xi). Informally stated, a generated pair (Xs,Xs_i) = 
{xs,Xs-i) completely determines the value F{xs-i,Xs) for U^. 



7 Monotonicity 

Throughout Section |^ we suppose that {X, B) is a Polish space with (Borel 
(T-field and) closed partial order <; the meaning of closed here is that {{x,y) G 
X X X : X < y} is assumed to be closed in the product topology. [For example, 
closedness is automatic for any partial order if {X, B) is discrete.] We also assume 
that there exist (necessarily unique) elements and 1 in (called bottom element 
and top element, respectively) such that < a; < 1 for all x G ^Y. 

We require the notion of stochastic monotonicity, according to the following 



definitions (which extend Definition 4.1 of Fill [|TT| to our more general setting). 



Definition 7.1 (a) A subset of A* is called a down-set or order ideal if, whenever 
X G -B and y < x, we have y E B. 

(b) Given two probability measures i^i and 1^2 on B, we say that z/i < z/2 
stochastically, and write z/i ^ z/2, if i^i{B) > z/2(-B) for every closed down-set B. 

(c) A kernel K is said to be stochastically monotone (SM) if K{x, ■) ^ K{y, ■) 
whenever x < y. 



The main goal of this section is to describe an analogue of Algorithm 2A 
which applies when only stochastic monotonicity of K is assumed. Here is a 
rough formulation; the details will be discussed in Section [7^ 



18 



Algorithm 7.2 Consider a stochastically monotone kernel on a partially ordered 
set with bottom element and top element 1. Choose and fix a positive integer t, 
set Xt = 0, and perform the following routine. Run the time-reversed chain K 
for t steps, obtaining Xj, Xj_i, . . . , Xq in succession. Then, reversing the direction 
of time, generate a chain, say Y = (Yq, . . . , Yt), with Yq = 1 and kernel K; this 
trajectory is to be coupled ex post facto with X = (Xq, . . . , Xf), which is regarded 
as a trajectory from K. Finally, we check whether Y^ = 0. If so, the value Xq is 



accepted as an observation from vr; if not, we repeat (as for Algorithm |2.1| , but 
always with Xt = 0). 



Remark 7.3 When X is finite. Algorithm |7.2| reduces to the algorithm of Sec- 



tion 7.2 in Fill 11 



7.1 Realizable monotonicity 



It is easy to see from ( |6.6| ) that if there exists a monotone transition rule, i.e., 
a transition rule with the property that 

(7.1) (j){x,u) < (f){y,u) for every u whenever x <y, 

then K is stochastically monotone according to Definition |7?T|(c). We call this 
stronger property realizable monotonicity . It is a common misbelief that, con- 
versely, stochastic monotonicity for K implies realizable monotonicity. This myth 
is annihilated by Fill and Machida [|I^ and Machida |]30|, even for the case of a 



finite poset X, for which it is shown that every stochastically monotone kernel is 
realizably monotone (i.e., admits a monotone transition rule) if and only if the 
cover graph of X (i.e., its Hasse diagram regarded as an undirected graph) is 
acyclic. 

Nevertheless, realizable monotonicity can be used to motivate and explain 
Algorithm |7.2|. Indeed, suppose for the remainder of this subsection that K admits 



a monotone transition rule as in (|7.1|) . Then we proceed to build a bounding 
process as in Section |5.2| and show how Algorithm |2.1| can be applied efficiently. 
One immediately verifies by induction that 

(7.2) Y,(0) < Y,(x) < Y,(i) for all < s < t and all x eX. 

Thus D, := [Y,(0),Y,(i)]^= {y e X : Y,(0) <y < Y,(i)} gives a bounding 
process, and the pair (Ys(0), Ys(l)) is a quite concise representation of D^. In 
plain language, since monotonicity is preserved, when the chains Y(0) and Y(l) 
have coalesced, so must have every Y(x). 

But note also that, if we choose the initial state Xt to be 0, then {Yt(l) = 0} is 
the coalescence event C. Algorithmically, it follows that if Xt = is a legitimate 
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starting point for Algorithm p.l| [as discussed near the end of Section |6.2| , it 
is sufficient for this that 7r({0}) > 0], and if P(C) > 0, then /:(Xo|C) = vr. 
Informally put, we need only track the single upper-bound trajectory Y(l) in 
the forward phase; if Yt(l) = 0, then the routine (correctly) accepts Xq as an 
observation from vr. 

Notwithstanding the fundamental distinction between the two notions of mono- 
tonicity, the routine of Algorithm shares with its cousin from the realizably 
monotone case the feature that, in the forward phase, the user need only check 
whether a single i^-trajectory started at 1 ends at 0. 



Remark 7.4 (a) Lower and upper bounding processes can also be constructed 
when Algorithm ^]l| is applied with a so-called "anti-monotone" transition rule; 
we omit the details. See Haggstrom and Nelander [|T9|1, Huber |l23l|, Kendall [E6 



See Haggstrom and Nelander 
M0ller M0ller and Schladitz ||3^, and Thonnes [Q for further discussion in 
various specialized settings. There are at least two neat tricks associated with 
anti- monotone rules. The first is that, by altering the natural partial order on X, 
such rules can be regarded, in certain bipartite-type settings, as monotone rules, in 
which case the analysis of Section [7.3| (with and 1 taken in the altered ordering, 
of course) is available: consult Section 3 of [O, the paper EM, and Definition 5.1 



m 



42|. The second is that the poset X is allowed to be "upwardly unbounded" 



and so need not have a 1: consult, again, and [^ 



(b) Dealing with monotone rules on partially ordered state spaces without 1 is 
problematic and requires the use of "dominating processes." We comment that a 
dominating process provides a sort of random bounding process and is useful when 
the state space is noncompact, but we shall not pursue these ideas any further 
here. See Kendall [^ and Kendall and M0ller in the context of CFTP; we 
hope to discuss the use of dominating processes for our algorithm in future work. 



7.2 Rigorous description of Algorithm 7.2 

Let K be an SM kernel with stationary distribution vr and let K be its time- 
reversal, exactly as in the paragraph containing (|6.5| ). Since we will not be using 
a transition rule, we otherwise forsake the development in Sections and 
apply afresh the general framework of Section |6.1| . For simplicity, we assume 
^({0}) > weaker conditions are possible for sufficiently regular chains — see 
Machida [RTI. 



Upward kernels: According to Theorem 1 of Kamae, Krengel, and O'Brien ^5 



Definition |7.1| (b) is equivalent to the existence of an upward kernel M [i.e., a 
Markov kernel on X such that, for all x, M{x, ■) is supported on {y ^ X : y > x}] 
satisfying i>2 = viM. Thus our assumption that K is SM implies the existence of 
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upward kernels M^y, x < y, such that 

(7.3) K{y, ■) = j K{x, dx') M^y{x' , ■) for all x<y. 

Choose and fix such kernels M^j-y] we will assume further that M^y{x', B) is jointly 
measurable in (x, x') G for each B E B. 

The Markov chain and the first probability space: We consider a probability 
space {Q, A, P) on which are defined a Markov chain X = {Xq, . . . , Xf) satisfying 

P{Xo G dxQ, . . . ,Xt E dxt) = 7r{dxo)K{xo, dxi)- ■ -K^Xt-i, dxt) 

and another process Y = {Yq, . . . ,Yt) such that 

(7.4) P{Y e dy\X = x) = 6iidyo) M^^i{xi,dyi) ■ ■ ■ M,,_,,y^^,{xt, dyt). 
Recalling the notational note at the end of Section |l], let 

(7.5) C := {Yt = 6} 

and observe that conditions (i)-(iv) and (v') in Section |6.1| are satisfied by taking 

(7.6) iX',B') = {X,B), B' = {6}ieB'), i3" = {0,5'}, 

X = Xo, X' = Xt, Y = Yt, C as at (7.5). 

Condition (v') follows from the independence of Xq and F, which in turn can be 
verified by a simple calculation using ( [7.4|) and ( [7.3|) . 

The second probability space and the algorithm: To set up the second prob- 
ability space, we compute conditional probability distributions in stages, as in 
Section lO. As there, 



P{Xq G dxQ, . . . , Xt-i G dxt-i I Xt = Xt) = K{xt, dxt-i) ■ ■ ■ K{xi, dxo) 

serves as a conditional distribution for (Xq, . . . , Xt-i) given Xt = Xt. Furthermore, 
( [7.4| ) gives a conditional distribution for Y given X. 

We now see how to set up and simulate our second space. The user sets 
:= 0, then draws an observation Xt_i = Xt-i from K{xt, ■), then an observation 
Xt_2 = Xt-2 from K{xt-i, ■), etc. Next, having chosen X = x, the user draws an 
observation Y = y from C{Y \ X = x) . [In detail, this is done by setting Yq := 1, 
then drawing Yi = yi from M^^ i{xi, ■), then Y2 = y2 from Mj;^^y^(x2, ■); stc] 
Finally, the user sets C := {Yt = 0}. With the definitions (|7.6|) , this definition 
of C, and 

X = Xo, X' = Xt, D(0) = l/7r({0}), 

assumptions (i)-(x) of Section are verified in a straightforward manner. Ac- 
cording to Corollary ^ if P(C) > 0, then i:(Xo|C) = vr. Thus Algorithm ^ 
works as claimed. 
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Remark 7.5 If the stronger assumption of realizable monotonicity holds, then Al- 



gorithm |7.2| reduces to the specialization of Algorithm p.l| discussed above. This 
follows from the fact that one can then take 



M,y{x', ■) := P(0(i/, t/i) G ■ I Ui) = x'), 



(7.7) 



for the upward kernels in (|7.3|) , provided that RHS (|7.71 ) is jointly measurable in 



7.3 Performance of Algorithm 7.2 



Using condition (x), we see that the routine in Algorithm has probability 



(7.^ 



PfC) 



PiC) K\i,{o}) 



7r({0}) 7r({0}) 



of accepting the generated value Xq. To understand this in another way, note that 
our assumption vr({0}) > implies that K\0, •) ^ 7r(-), and that the decreasing 
function x ^ K'^{x, {0})/tt{{0}) on X serves as the Radon-Nikodym derivative 
(RND) X K\6, dx)/Ti{dx). With this choice of RND, we have 

P(C)^inf^; 

^ 7r((ix) 

this last expression is a natural (though stringent) measure of agreement between 
the distribution K*((), ■) and the stationary distribution. In the discrete case, our 
performance results reduce to results found in Sections 7-8 of Fill []TT|; consult 
that paper for further discussion. 



7.4 An extension: stochastic cross-monotonicity 

There is no reason that the chains X and Y in Algorithm need have the 
same kernel. Thus, consider two (possibly different) kernels K and L satisfying 

the stochastic cross-monotonicity (or cross-SM) property 

K{x, ■) < L{y, ■) whenever x < y; 

if K = L, this reduces to the Definition [7.1| (c) of SM. We can then run Algo- 
rithm |7.2| , replacing "with Yq = 1 and kernel /T" by "with Yq = 1 and kernel L." 
The rigorous description of this algorithm is left to the reader. The analogue 
of ( [7.3|) is of course 

(7.9) L{y, ■)= f K{x, dx') M^y{x' , ■) for all x < y; 
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and if we have cross-monotone transition rules (px and (pi [according to the ap- 



propriate generahzation of (|7.1| )], then one can take 



(7.10) M^yix', ■) := P{My, Ui)e-\ (Pk{x, Ui) = x'). 

In the cross-SM case, we have the extension 

P{C) L*(i,{0}) 



PC 



7r({6}) vr({6}) 



of (^. 

One apphcation of the cross-SM case which may arise in practice (Machida 
plans to employ the following idea in a future apphcation of our algorithm to 
the mixture problems considered by Hobert et al. [^]) is that of stochastic domi- 
nance [K{x, ■) ^ L{x, ■) for all x E X]hj an SM kernel L; indeed, then K{x, ■) ^ 
L{x, ■) ^ L{y, ■) whenever x < y. Suppose further that L has stationary distribu- 
tion a with a({0}) > 0; then, with p := cT({0})/7r({0}), 

(7.11) P(c) = inf^M = pmfMMliW^ 

' ^ ' ' 7r({0}) » 'y(dy) 

using the decreasing function y L*(y,{0})/cr({0}) as the choice of RND y 
L\{6},dy)/aidy). 

The practical implication (say, for finite-state problems, as we shall assume for 
ease of discussion for the remainder of this section) is that if the user is unable to 
find a rapidly mixing SM kernel K with stationary distribution vr (the distribution 
of interest) but can find a rapidly mixing stochastically dominant SM kernel L with 
stationary distribution a "not too much larger than vr" (at {0}), then our cross- 
SM algorithm can be applied efficiently, provided the imputation of Ui inherent 
in ( |7.10|) can be done efficiently. 



To temper enthusiasm, however, we note that the limit (namely, p) as t — * oo 
of the acceptance probability ( |7. 1 1| ) can often be achieved in simpler fashion. 



Suppose, for example, that simulation from a is easy, that x i-^ 7r({a;}) and 
X (— > a{{x}) can be computed exactly except for normalizing constants, and that 
X = minimizes the ratio a{{x})/'n'{{x}). Then one can employ elementary 
rejection sampling to simulate from tt, with acceptance probability p. 

8 Relation to CFTP 
8.1 Comparison 



How does our extension of Fill's algorithm, as given by Algorithm |2.1| , compare 
to CFTP? As we see it, our algorithm has two main advantages and one main 
disadvantage. 
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Advantages: As discussed in Section |1] and Remark §]9|(b) and in [|r^ , a primary 



advantage of our Algorithms |2.1| and |3]^ is interruptibility. A related second 
advantage of Algorithm 2A concerns memory allocation. Suppose, for example, 
that our state space X is finite and that each time-step of Algorithm |2.1| , including 
the necessary imputation (recall Section |6.3| ), can be carried out using a bounded 
amount of memory. Then, for fixed t, our algorithm can be carried out using 
a fixed finite amount of memory. Unfortunately, it is rare in practice that the 
kernel K employed is sufficiently well analyzed that one knows in advance a value 
of t (and a value of the seed X^) giving a reasonably large probability P(C) of 
acceptance. Furthermore, the fixed amount of memory needed is in practice larger 
than the typical amount of memory allocated dynamically in a run of CFTP. See 
also the discussion of read-once CFTP in Section |8T 



Disadvantage: A major disadvantage of our algorithms concerns computational 
complexity. We refer the reader to []TT| and for a more detailed discussion in 



the setting of realizable monotonicity (and, more generally, of stochastic mono- 
tonicity). Briefly, if no attention is paid to memory usage, our algorithms have 
running time competitive with CFTP: cf. Remark |6.9| (c), and also the discussion 
in Remark 9.3(e) of 0] that the running time of our Algorithm 2A is, in a certain 



sense, best possible in the stochastically monotone setting. However, this analy- 
sis assumes that running time is measured in Markov chain steps; unfortunately, 
time-reversed steps can sometimes take longer than do forward steps to execute 
(e.g., |]12|)5 and the imputation described in Section ^]3| is sometimes difficult to 
carry out. Moreover, the memory usage for naive implementation of our algo- 
rithm can be exorbitant; how to trade off speed for reduction in storage needs is 
described in [^]. 

8.2 Connection with CFTP 



There is a simple connection between CFTP and our Algorithm |3]T|. Indeed, 
suppose we carry out the usual CFTP algorithm to sample from vr, using kernel K, 
transition rule 0, and driving variables U = (Uq, U_i, . . .). Let T denote the 
backwards coalescence time and let Xq ~ tt denote the terminal state output by 
CFTP. Let W ~ TT independent of U, and follow the trajectory from X_t '■= W 
to Xq; call this trajectory X = (X_t, . . . , Xq). Since Xq is determined solely 
by U, the random variables W and Xq are independent. 



When TT = vr in Algorithm |3.1| , the algorithm simply constructs the same 
probability space as for CFTP, but with the ingredients generated in a different 
chronological order: first Xo,X_i, . . .; then U (which determines T); then W := 
X_T- Again Xq ~ tt and W ~ tt are independent. 



Note that a fundamental difference between Algorithm 3.1 and CFTP is in 



what values they report. CFTP reports Xq as its observation from vr, while 
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Algorithm |3.1| reports the value W = X_t- 



Remark 8.1 (a) Because of this statistical independence, it does not matter in 
Algorithm |0| that we actually use Xq ~ tt 7^ tt. 

(b) The fact (1) that W, unlike Xq, is independent of U, together with (2) that 
T depends solely on U, explains why our algorithm is interruptible and CFTP is 
not. 

(c) In a single run of CFTP, the user would of course be unable to choose 



W ~ TT as above, just as in a single run of Algorithm |3.1| we do not actually 
choose Xq ~ tt. So one might regard our described connection between the two 
algorithms as a bit metaphorical. But see Section |9.2|. 



Read-once CFTP: We note also that our Algorithm p. 1| bears close resemblance 



to the read-once CFTP algorithm of Wilson Indeed, that algorithm shares 
with ours the property of requiring only a bounded amount of memory. Further- 
more, in the execution of Algorithm PTTI, consider relabeling time as follows. At 



the k attempt at coalescence {k = 1,2,...), consider X_(fc_i)f, . . . , X_fct in place 
of Xt, . . . ,Xo, respectively, where now X_fct plays two roles: the candidate for 



output in attempt k and the starting state in attempt k + 1. Then Algorithm ^TT 
will output X_fct if and only if coalescence occurs in attempt k. Read-once CFTP, 
on the other hand, runs chains forwards in time from time 0, outputting X^j if 
and only if coalescence occurs beginning at time kt and ending at time {k + l)t 
and has also occurred on some earlier such interval. Thus, in some sense, the 
two algorithms are mirrors of one another. However, the read-once algorithm is 
not interruptible, because its output (which necessarily follows a previous coales- 
cence), conditional on rapid termination, is biased towards values resulting from 
fast first coalescence. By contrast, the output of Algorithm 12]^ does not require 
computation of a previous coalescence. 



9 Perfect samples of arbitrary size 

Thus far in this paper we have considered only the problem of obtaining a 
single observation from the distribution tt of interest. What can be done to obtain 
a perfect sample of size ra? 

9.1 Elementary options 



As discussed in Section 3 of Fill |Tl]] and in Murdoch and Rosenthal [p6| , 
the simplest option is to apply, repeatedly and independently, an algorithm pro- 
ducing a perfect sample of size 1. When the size-1 algorithm employed is inter- 



ruptible (e.g.. Algorithm |2?T| or Algorithm |3.1|) , the resulting size-n algorithm is 
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both observationwise interruptible [in the sense that, for k = l,...,n, the kth 
observation output, say W^, and the number of Markov chain steps required, 
say Tfc, are conditionally independent given all randomness used in the genera- 
tion of Wi, . . . , Wfc_i] and totally interruptible [in the sense that T+ := J2k=i Tfc 
and W := (Wi,...,W„) are independent]. A related observation is that the 
conditional distribution of a fixed- duration sample given its size is that of an i.i.d. 
sample, again provided that time is measured in Markov chain steps; contrast 



comment 3 concerning CFTP in Remark 5.3 of |Tl[. 

Another simple option is to generate a single observation, say Wi, from a size-1 
perfect sampler and then run the chain K (or K, or any other chain with stationary 
distribution vr) for n — 1 steps from Wi, obtaining W = (Wi, . . . , W„). Note 
that each observation is marginally distributed as tt. This size-n algorithm is 
also observationwise interruptible and totally interruptible. Of course, statistical 
use of W is complicated by the serial dependence of its entries. 

A third option is to compromise between the first two ideas and generate 
u independent vectors W*^^\ . . . , W*^'^^ where W*^*-* is obtained using the size-tj 
sampler described in the preceding paragraph and u and (ti, . . . ,t^) are chosen 
(in advance, deterministically) so that ti + ■ ■ - + 1^ = n. Again we have both forms 
of interruptibility. This third option is an interruptible analogue of the "RCFTP 
(Repeated CFTP) tours" of Murdoch and Rosenthal [^ . 



9.2 Efficient use of size-1 perfect samplers 

All three of the options in Section |9.1| seem wasteful. After all, a great deal of 



randomness and computational effort goes into the use of Algorithm 2.1 or 3.1, yet 



only a single observation from vr comes out. Is there a way to be more efficient? 



Our short answer is this: Yes, but only for Algorithm |3.1| , and then the advan- 
tage of interruptibility is forfeited. The remainder of this subsection provides an 
explanation. 

Using Algorithm Suppose that we have in hand an observation Wi ~ vr, 
for example from a first run of Algorithm p. 1| . If we feed Wi into the routine 
of Algorithm ^?T] as X^, then the resulting X is unconditionally distributed as a 
stationary trajectory from K; in particular, X^ ~ tt for < s < t. Unfortunately, 
this is not generally true conditionally given success (i.e., given coalescence): 

Example 9.1 Consider again the toy random- walk example of Section ^, again 
with t = 2 but now with the modifications 

A;(0,0) = fc(2,2) = 3/4, A;(0, 1) = A;(2, 1) = 1/4, A;(l, 0) = fc(l, 2) = 1/2, 
k{0,2) = k{l,l) = k{2,0) = 

and TT = (2/5, 1/5,2/5) to K and vr. Suppose that the seed value X2 is chosen 
according to tt, and that the independent-transitions rule is used. Then of course 
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£(Xo|C) = 71, but one can check that £(Xi|C) 
£(X2|C) = (4/11,3/11,4/11). 



(7/22, 8/22, 7/22) and that 



This example illustrates a rather catastrophic fact about Algorithm |2.1| : If 
we use the output W2 := Xq as an observation from vr, we may no longer use 
Wi = Xf as such. We know of no systematic way to make use of the auxiliary 
randomness generated in a run of Algorithm p.l| . 



Using Algorithm |3.1| .' Let Algorithm |3.1| ^ denote Algorithm modified as 
follows. For fixed to, use T' := max(T,to — 1) in place of T. Algorithm 3A' uses 
a conservative detection rule (see section ^), and so is still valid. 

Suppose again that we have in hand an observation distributed as vr, say from 
a first run of Algorithm |3.1f or |3.1| or p.l| . If we call this observation ^il^ and feed 

is 



it into Algorithm p. If as Xq, then the resulting trajectory ( 
distributed as a stationary trajectory from K; in particular. 



X_2i X_i , X, 



-1) 



w(^) = (Wj 



(1) 



w 



to y 



fx_ 



(to-1), • • • , Xq 



(2) 

is a stationary trajectory of length to from K. Defining Wj^ := X_t, we now 
feed wl^"* into Algorithm p.lf as the new seed Xq; using randomness otherwise 
independent of the first run, we obtain another stationary trajectory, call it W*^^\ 
of length to from K. Taking W^^^ to be the value of X_t from this second run, 
we then provide W^j^'* as a seed producing W^'^^ and so on. 

It is not hard to see that, for any z/ > 1, the joint distribution of the "tours" 
(as Murdoch and Rosenthal |3^ call them) W^*^-*, . . . , W^^-*, W*-^-* is the same as 
the joint distribution of u serially generated "Guarantee Time CFTP" (GTCFTP) 
tours as in Section 5 of [^. (Their "guarantee time" Tg is our to — 1.) Since each 
tour W^*) has the marginal distribution 



(9.1) 



P(W^*) G dtv) = 7i{dwi) K{wi, dw2) ■ ■ ■ K{wto-i, dw. 



to/ 



and since both (W*^-^-*, W*^^-*, . . . , W'^'^^) and the sequence of GTCFTP tours are 
clearly tour-valued Markov chains, we see that our tour-chain is simply the sta- 
tionary time-reversal of the (stationary) GTCFTP tour-chain, where the common 
stationary tour-distribution is (|0|). 

Since GTCFTP tours are analyzed by Murdoch and Rosenthal [0, we refer 
the interested reader to ^6| for further discussion and analysis. One highlight is 
that while the tours W*^^\ W*^^-*, . . . are not independent, they are 1-dependent: 
that is, W(^) and (W^^), W^^), . . . , w^i-^\W^'+^\W^'+^\ . . .) are independent for 
every i. 



Remark 9.2 For to = 1, one can check that the tour-algorithm we have described is 
an observationwise interruptible and totally interruptible algorithm for producing 
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independent observations from vr. Unfortunately, for to > 2, one can check that 
all interruptibility is lost, due to the dependence of (X_(t(,_i), . . . ,X_i) and X_t 



(unlike the independence of Xq and X_t) in Algorithm 3.1 



In light of the above remark, and the fact that CFTP tours are generated 
considerably more easily than are our Algorithm |3.1f tours, we see nothing to 



recommend the use of Algorithm in practice. (We remark, nonetheless, that 
it was in investigating our tours that we discovered the connection between Algo- 
rithm B.ll and CFTP described in Section 18.21.) 
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